ConDist: A Context-Driven Categorical Distance Measure
نویسندگان
چکیده
A distance measure between objects is a key requirement for many data mining tasks like clustering, classification or outlier detection. However, for objects characterized by categorical attributes, defining meaningful distance measures is a challenging task since the values within such attributes have no inherent order, especially without additional domain knowledge. In this paper, we propose an unsupervised distance measure for objects with categorical attributes based on the idea, that categorical attribute values are similar if they appear with similar value distributions on correlated context attributes. Thus, the distance measure is automatically derived from the given data set. We compare our distance measure to existing categorical distance measures and evaluate on different data sets from the UCI machine-learning repository. The experiments show that our distance measure is recommendable, since it achieves similar or better results in a more robust way than previous approaches.
منابع مشابه
Automatic Threshold Calculation for the Categorical Distance Measure ConDist
The measurement of distances between objects described by categorical attributes is a key challenge in data mining. The unsupervised distance measure ConDist approaches this challenge based on the idea that categorical values within an attribute are similar if they occur with similar value distributions on correlated context attributes. An impact function controls the influence of the correlate...
متن کاملSimilarity Measures for Categorical Data: A Comparative Evaluation
Measuring similarity or distance between two entities is a key step for several data mining and knowledge discovery tasks. The notion of similarity for continuous data is relatively well-understood, but for categorical data, the similarity computation is not straightforward. Several data-driven similarity measures have been proposed in the literature to compute the similarity between two catego...
متن کاملDistance based Clustering for Categorical Data
Learning distances from categorical attributes is a very useful data mining task that allows to perform distance-based techniques, such as clustering and classification by similarity. In this article we propose a new context-based similarity measure that learns distances between the values of a categorical attribute (DILCA DIstance Learning of Categorical Attributes). We couple our similarity m...
متن کاملMaking the Nearest Neighbor Meaningful.PDF
The nearest-neighbor problem arises in clustering and other applications. It requires us to define a function to measure differences among items in a data set, and then to compute the closest items to a query point with respect to this measure. Recent work suggests that the conventional Euclidean measure does not adequately model highdimensional data. We present a new, data-driven difference me...
متن کاملContext-Based Distance Learning for Categorical Data Clustering
Clustering data described by categorical attributes is a challenging task in data mining applications. Unlike numerical attributes, it is difficult to define a distance between pairs of values of the same categorical attribute, since they are not ordered. In this paper, we propose a method to learn a context-based distance for categorical attributes. The key intuition of this work is that the d...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015